Before introducing the topic that will concern us, we will clean the workspace completely.
rm(list=ls())
Imagine a company whose main field is big data related is looking for a new data science; after some time, when they have selected the most promising profiles, they want to distinguish which of them will be more profitable for them in the long run.
Logically, they want to reduce costs in training courses by hiring a person that has more options to stay in the company after a long time.
HR Analytics: Job Change of Data Scientists is a dataset that uses credentials, demographics, education, experience and other social considerations to predict which workers are more likely to stay in the company or leave to accept another position.
Moreover, thanks to this data we will be able to know other facts such as which aspects are more important for an employee to keep their data scientist position or drop out. Find the database ready to download in this link
We are going to load all libraries and set the working directory.
library(xfun)
library(rmarkdown)
library(factoextra)
library(dplyr)
library(ggplot2)
library(cluster)
library(dendextend)
library(psych)
library(tidyverse)
library(tidyr)
library(kernlab)
library(mclust)
library(NbClust)
library(GGally)
library(Amelia)
library(corrplot)
We will consider our full data is train_data (14 columns), because if we merge it with test_data (13 columns) we will have one important, missing column (target) in the whole dataset.
For our purpose, I consider more valuable the information provided by this variable than the few, non-characteristic observations we have in test_data.
setwd("C:/Users/lucia/OneDrive/Documentos/Universidad/Segundo curso/Aprendizaje")
data = read.csv("aug_train.csv",header=T, na.strings="")
head(data)
## enrollee_id city city_development_index gender relevent_experience
## 1 8949 city_103 0.920 Male Has relevent experience
## 2 29725 city_40 0.776 Male No relevent experience
## 3 11561 city_21 0.624 <NA> No relevent experience
## 4 33241 city_115 0.789 <NA> No relevent experience
## 5 666 city_162 0.767 Male Has relevent experience
## 6 21651 city_176 0.764 <NA> Has relevent experience
## enrolled_university education_level major_discipline experience company_size
## 1 no_enrollment Graduate STEM >20 <NA>
## 2 no_enrollment Graduate STEM 15 50-99
## 3 Full time course Graduate STEM 5 <NA>
## 4 <NA> Graduate Business Degree <1 <NA>
## 5 no_enrollment Masters STEM >20 50-99
## 6 Part time course Graduate STEM 11 <NA>
## company_type last_new_job training_hours target
## 1 <NA> 1 36 1
## 2 Pvt Ltd >4 47 0
## 3 <NA> never 83 0
## 4 Pvt Ltd never 52 1
## 5 Funded Startup 4 8 0
## 6 <NA> 1 24 1
Now I will first see the structure of our data, analyse the format and work with the NAs.
str(data)
## 'data.frame': 19158 obs. of 14 variables:
## $ enrollee_id : int 8949 29725 11561 33241 666 21651 28806 402 27107 699 ...
## $ city : chr "city_103" "city_40" "city_21" "city_115" ...
## $ city_development_index: num 0.92 0.776 0.624 0.789 0.767 0.764 0.92 0.762 0.92 0.92 ...
## $ gender : chr "Male" "Male" NA NA ...
## $ relevent_experience : chr "Has relevent experience" "No relevent experience" "No relevent experience" "No relevent experience" ...
## $ enrolled_university : chr "no_enrollment" "no_enrollment" "Full time course" NA ...
## $ education_level : chr "Graduate" "Graduate" "Graduate" "Graduate" ...
## $ major_discipline : chr "STEM" "STEM" "STEM" "Business Degree" ...
## $ experience : chr ">20" "15" "5" "<1" ...
## $ company_size : chr NA "50-99" NA NA ...
## $ company_type : chr NA "Pvt Ltd" NA "Pvt Ltd" ...
## $ last_new_job : chr "1" ">4" "never" "never" ...
## $ training_hours : int 36 47 83 52 8 24 24 18 46 123 ...
## $ target : num 1 0 0 1 0 1 0 1 1 0 ...
We have many variables with a character type, which would not allow us to work with it later. For this reason, we will change them to factor, and later to numeric (unless the value is not numeric but categorical).
data$city = as.factor(data$city)
data$gender = as.factor(data$gender)
data$relevent_experience = as.factor(data$relevent_experience)
data$enrolled_university = as.factor(data$enrolled_university)
data$education_level = as.factor(data$education_level)
data$major_discipline = as.factor(data$major_discipline)
data$experience = as.factor(data$experience)
data$company_size = as.factor(data$company_size)
data$last_new_job = as.factor(data$last_new_job)
data$training_hours = as.integer(data$training_hours)
data$target = as.factor(data$target)
str(data)
## 'data.frame': 19158 obs. of 14 variables:
## $ enrollee_id : int 8949 29725 11561 33241 666 21651 28806 402 27107 699 ...
## $ city : Factor w/ 123 levels "city_1","city_10",..: 6 78 65 15 51 58 50 84 6 6 ...
## $ city_development_index: num 0.92 0.776 0.624 0.789 0.767 0.764 0.92 0.762 0.92 0.92 ...
## $ gender : Factor w/ 3 levels "Female","Male",..: 2 2 NA NA 2 NA 2 2 2 NA ...
## $ relevent_experience : Factor w/ 2 levels "Has relevent experience",..: 1 2 2 2 1 1 1 1 1 1 ...
## $ enrolled_university : Factor w/ 3 levels "Full time course",..: 2 2 1 NA 2 3 2 2 2 2 ...
## $ education_level : Factor w/ 5 levels "Graduate","High School",..: 1 1 1 1 3 1 2 1 1 1 ...
## $ major_discipline : Factor w/ 6 levels "Arts","Business Degree",..: 6 6 6 2 6 6 NA 6 6 6 ...
## $ experience : Factor w/ 22 levels "<1",">20","1",..: 2 9 18 1 2 5 18 7 20 11 ...
## $ company_size : Factor w/ 8 levels "<10","10/49",..: NA 6 NA NA 6 NA 6 1 6 5 ...
## $ company_type : chr NA "Pvt Ltd" NA "Pvt Ltd" ...
## $ last_new_job : Factor w/ 6 levels ">4","1","2","3",..: 2 1 6 6 5 2 2 1 2 1 ...
## $ training_hours : int 36 47 83 52 8 24 24 18 46 123 ...
## $ target : Factor w/ 2 levels "0","1": 2 1 1 2 1 2 1 2 2 1 ...
sum(is.na(data))
## [1] 20733
In Company size there is value 10/49 in some rowa, which is not the format in the rest of sizes of the company, so I will start cleaning this data changing the “/” sign to “-” (10/49 to 10-49).
Later we check that there are no “10-49” findings.
data$company_size = gsub("/","-",data$company_size)
On the other hand, there are about 20000 NAs.This amount of missing values would hurt our models, so let’s see where they are located exactly (numerically and graphically) and fix them with the mice function.
null_1 = sum(is.na(data$enrollee_id))
null_2 = sum(is.na(data$city))
null_3 = sum(is.na(data$gender))
null_4 = sum(is.na(data$city_development_index))
null_5 = sum(is.na(data$relevent_experience))
null_6 = sum(is.na(data$enrolled_university))
null_7 = sum(is.na(data$education_level))
null_8 = sum(is.na(data$major_discipline))
null_9 = sum(is.na(data$experience))
null_10 = sum(is.na(data$company_size))
null_11 = sum(is.na(data$company_type))
null_12 = sum(is.na(data$last_new_job))
null_13 = sum(is.na(data$training_hours))
null_14 = sum(is.na(data$target))
null_df = data.frame(variable = c("enrollee_id","city","city_development_index","gender","relevent_experience","enrolled_university","education_level","major_discipline","experience","company_size","company_type","last_new_job","training_hours","target"),
total_null = c(0,0,0,4508,0,386,460,2813,65,5938,6140,423,0,0))
null_df
## variable total_null
## 1 enrollee_id 0
## 2 city 0
## 3 city_development_index 0
## 4 gender 4508
## 5 relevent_experience 0
## 6 enrolled_university 386
## 7 education_level 460
## 8 major_discipline 2813
## 9 experience 65
## 10 company_size 5938
## 11 company_type 6140
## 12 last_new_job 423
## 13 training_hours 0
## 14 target 0
Data visualization : NAs by column
null_df$variable = factor(null_df$variable,
levels = null_df$variable[order(null_df$total_null,decreasing = TRUE)])
ggplot(null_df,aes(x=variable,y=total_null))+ geom_bar(stat = "identity", col = "#FFD972", fill = "#09BC8A")+ theme(axis.text.x = element_text(angle = 90, hjust =1, vjust = 0.5))+
geom_label(aes(label = total_null, size = NULL), nudge_y = 0.7)+
theme(plot.title = element_text(hjust = 0.5))+
ggtitle("Total NAs by col")+
xlab("Missing values")
ylab("Total NAs")
## $y
## [1] "Total NAs"
##
## attr(,"class")
## [1] "labels"
missing = is.na(data)
sum(missing)
## [1] 20733
missmap(data, main = 'Missing Map', col = c('#58A4B0', '#373F51'))
To fix the null values we have many tools: removing them straight, summing them, using the median and the mode, predicting them… First, I tried mice function tool.
Mice creates imputations (which are replacements for the missing values) for multivariable missing data.
#imp = mice(data, m = 5, method = #c("","","","pmm","pmm","pmm","pmm","pmm","pmm","pmm","pmm","pmm","",""),
#maxit = 20)
#final_data = complete(imp,5)
#summary(final_data)
The main issue is that it is very time and power-consuming, so I discarded this option and I computed the most repeated factor for each variable, assigning each of them to their NA.
For example, men are the predominant gender so we may assume if there is a genderless cell, the person is more likely to be a man In the Exploratory Data Analysis part (right next to this one) we will explore these relations of predominance.
For now, we will use those conclusions and, in the next point, show it graphically:
data$city_development_index[which(is.na(data$city_development_index))] = mean(data$city_development_index, na.rm = TRUE)
data$gender[which(is.na(data$gender))] = "Male"
data$company_size[which(is.na(data$company_size))] = "50-99"
data$relevent_experience[which(is.na(data$relevent_experience))] = "No relevent experience"
data$enrolled_university[which(is.na(data$enrolled_university))] = "no_enrollment"
data$education_level[which(is.na(data$education_level))] = "Graduate"
data$major_discipline[which(is.na(data$major_discipline))] = "STEM"
data$experience[which(is.na(data$experience))] = ">20"
data$last_new_job[which(is.na(data$last_new_job))] = 1
data$training_hours[which(is.na(data$training_hours))] = mean(data$training_hours, na.rm = TRUE)
sum(data$target == 0)
## [1] 14381
sum(data$target == 1)
## [1] 4777
data$target[which(is.na(data$target))] = 0
Now, let’s consider the importance of variables: I will remove city index (I don’t find it useful if we don’t have any coordinates to know the city to which people belong), company type and enrollee ID, now that we know there are no repeated values for this variable.
We will work with these 11 variables left and check that there are no NA’s left.
data = select(data,-enrollee_id, -city,-company_type)
sum(is.na(data))
## [1] 0
missing = is.na(data)
sum(missing)
## [1] 0
missmap(data, main = 'Missing Map', col = c('#58A4B0', '#373F51'))
As there may be many profiles with the same characteristics, the ID number is unique for each of them. In this case, we conclude there are no duplicates.
By making some EDA work (note that we won’t take into account for the statistics the NAs), we discover that: * About 90% of the people are men.
* Over 70 % of the whole candidates have relevent experience, appreciating a trend where relevant experience is higher than non relevant experience when candidates attended to college. Overall, targets have higher not relevant experience than relevant experience who didn’t go to college.
* 73% of candidates were not enrolled in university by the time data were recopilated, and the other 27% are divided between full and part time courses.
* Roughly 86% of candidates went to college, and 60% of them have at most a degree; 22.7% a hold a Master’s degree, and 2.7% have a PhD.
* Most of these studies are STEM related (75%), humanities hold a 3.49% and business has a 1.7%.
* Unlike other variables, company size is fairly distributed but there is a slight trend in the candidates to work in a small size company.
* The intervals of 20+ years and 2 to 7 years if experience are very repeated. 20+ years of experience candidates are most likely to have relevant experience.
* Almost 1 out of 2 candidates have a year of difference between their current and their previous jobs, while above 17% have 4+ years of difference.
* Most of the candidates are from city index of 0.9. (highly developed).
The second most common interval is 0.6 ~ 0.7. City development index of 0.9+ is the majority for target 0’s, and 0.6 for target 1’s (looking for a new job).
* In our graph, 20 hours of training is the most common, and we do not appreciate big difference in the hours of training based on the relevent or irrelevant experience.
* Observation: People who has 20+ years of experience or 3 ~ 6 are very interested in changing jobs. The only difference between target 0 and 1 is the total candidates of 20+ group. 20+ group is the majority for the target 0. But candidates who has 20+ years and 3 ~ 6 years are almost equal for target 1’s.
ggplot(data, aes(x=gender))+
geom_bar(fill = "#D66BA0")+
geom_text(stat='count', aes(label=..count..), vjust=-1)+
ggtitle("Gender Distribution")+
theme(plot.title = element_text(hjust = 0.5))+
xlab("Gender")
ggplot(data,aes(x=company_size))+
geom_bar(fill = "#D66BA0")+
geom_text(stat='count', aes(label=..count..), vjust=-1)+
theme(plot.title = element_text(hjust = 0.5))+
ggtitle("Company size")+
xlab("Size")+
ylab("Count")
ggplot(data,aes(x=experience))+
geom_bar(fill = "#D66BA0")+
geom_text(stat='count', aes(label=..count..), vjust=-1,check_overlap = TRUE)+
theme(plot.title = element_text(hjust = 0.5))+
ggtitle("Experience")+
xlab("Experience")+
ylab("Count")
ggplot(data,aes(x= relevent_experience))+
geom_bar(fill = "#D66BA0")+
geom_text(stat='count', aes(label=..count..))+
theme(plot.title = element_text(hjust = 0.5))+
ggtitle("Relevant Experience")+
xlab("Relevant experience")
ylab("Count")
## $y
## [1] "Count"
##
## attr(,"class")
## [1] "labels"
ggplot(data,aes(x=relevent_experience))+
geom_bar(fill = "#D66BA0")+
facet_wrap(~education_level)+
ggtitle("Relevent experience by education level")+
theme(axis.text.x = element_text(angle = 90, hjust =1, vjust = 0.5))+
theme(plot.title = element_text(hjust = 0.5))+
xlab("Experience")+
ylab("Count")
ggplot(data,aes(x=enrolled_university,fill = relevent_experience))+
geom_bar()+
facet_wrap(~relevent_experience)+
theme(axis.text.x = element_text(angle = 90, hjust =1, vjust = 0.5))+
theme(plot.title = element_text(hjust = 0.5))+
geom_text(stat='count', aes(label=..count..), vjust=-1)+
ggtitle("Enrolled University")+
xlab("Enrolled University")+
ylab("Count")
ggplot(data,aes(x=major_discipline,fill=relevent_experience))+
geom_bar()+
facet_wrap(~gender)+
theme(axis.text.x = element_text(angle = 90, hjust =1, vjust = 0.5))+
theme(plot.title = element_text(hjust = 0.5))+
geom_text(stat='count', aes(label=..count..), vjust=-1,check_overlap = TRUE)+
ggtitle("College major")+
xlab("Major")+
ylab("Count")
ggplot(data,aes(x=experience))+
geom_bar(fill = "#D66BA0")+
facet_wrap(~relevent_experience)+
theme(axis.text.x = element_text(angle = 90, hjust =1, vjust = 0.5))+
geom_text(stat='count', aes(label=..count..), vjust=-1,check_overlap = TRUE)+
theme(plot.title = element_text(hjust = 0.5))+
ggtitle("Count by relevent experience")+
xlab("Experience")+
ylab("Count")
ggplot(data,aes(x=city_development_index,fill = relevent_experience))+
geom_bar()+
theme(plot.title = element_text(hjust = 0.5))+
ggtitle("City development index")+
xlab("City development")+
ylab("Count")
ggplot(data,aes(x= training_hours,fill = relevent_experience))+
geom_density()
ggplot(data,aes(x= training_hours,fill = relevent_experience))+
geom_density()+
facet_wrap(~relevent_experience)+
theme(plot.title = element_text(hjust = 0.5))+
ggtitle("Training hours by relevent experience")+
xlim(0,60)
ggplot(data,aes(x=education_level,fill = education_level))+
geom_bar()+
facet_wrap(~target)+
geom_text(stat='count', aes(label=..count..), vjust=-1,check_overlap = TRUE)+
theme(plot.title = element_text(hjust = 0.5))+
theme(axis.text.x = element_text(angle = 90, hjust =1, vjust = 0.5))+
ggtitle("Target by education level")+
xlab("Target")+
ylab("Count")
ggplot(data,aes(x=major_discipline,fill = major_discipline))+
geom_bar()+
facet_wrap(~target)+
theme(axis.text.x = element_text(angle = 90, hjust =1, vjust = 0.5))+
geom_text(stat='count', aes(label=..count..), vjust=-1,check_overlap = TRUE)+
theme(plot.title = element_text(hjust = 0.5))+
ggtitle("Target by Major discipline")+
xlab("Target")+
ylab("Count")
ggplot(data,aes(x=city_development_index,fill = relevent_experience))+
geom_density(alpha = 0.5)+
facet_wrap(~target)+
theme(plot.title = element_text(hjust = 0.5))+
ggtitle("Target by city development index")+
xlab("Target")+
ylab("Count")
ggplot(data,aes(x=enrolled_university,fill = enrolled_university))+
geom_bar()+
facet_wrap(~target)+
theme(axis.text.x = element_text(angle = 90, hjust =1, vjust = 0.5))+
geom_text(stat='count', aes(label=..count..), vjust=-1,check_overlap = TRUE)+
theme(plot.title = element_text(hjust = 0.5))+
ggtitle("Target by enrolled university")+
xlab("Target")+
ylab("Count")
In relation with the outliers, we will take the numerical variable training_hours to see if it is balanced or there are many outliers that will affect our computations:
ggplot(data)+
geom_boxplot(mapping = aes(x= gender,y= training_hours,fill=gender))+
coord_flip()+
ggtitle(" hours of training by gender")
The conclusion we can extract from this boxplot is that there is a lot of variability in our data: there are no clear outliers because many cases are out of the expected limits, creating unbalanced data. SMOTE method would be a solution to balance the data by oversampling, but it is not available for all versions of R.
data$city = as.numeric(data$city)
data$relevent_experience = as.numeric(data$relevent_experience)
data$gender = as.numeric(data$gender)
data$experience = as.numeric(data$experience)
data$last_new_job = as.numeric(data$last_new_job)
data$training_hours = as.numeric(data$training_hours)
data$target = as.numeric(data$target)
data$education_level = as.numeric(data$education_level)
data$major_discipline = as.numeric(data$major_discipline)
data$enrolled_university = as.numeric(data$enrolled_university)
data$company_size = as.factor(data$company_size)
data$company_size = as.numeric(data$company_size)
data = select(data, -city)
corrplot(cor(data), is.corr = F,
number.cex = 0.4, tl.cex = 0.4, cl.cex = 0.4)
We see the pairs that save more correlation are target with city development index, and last_new_job with relevant_experience.
pca = prcomp(data, scale = T, center = T)
summary(pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.3516 1.1289 1.04181 1.01213 0.99899 0.98635 0.94746
## Proportion of Variance 0.1661 0.1159 0.09867 0.09313 0.09073 0.08844 0.08161
## Cumulative Proportion 0.1661 0.2819 0.38060 0.47373 0.56445 0.65290 0.73450
## PC8 PC9 PC10 PC11
## Standard deviation 0.93341 0.89518 0.81149 0.76768
## Proportion of Variance 0.07921 0.07285 0.05987 0.05358
## Cumulative Proportion 0.81371 0.88656 0.94642 1.00000
I have created a table with the importance of each component in the total representation, and now let’s check that the new database is uncorrelated and the percentage of the variance explained by each PC. The percentage explained by the a component is the variance of a PC divided by the sum of the variance of all the PCs.
corrplot(cor(pca$x), is.corr = F,
number.cex = 0.4, tl.cex = 0.4, cl.cex = 0.4)
fviz_screeplot(pca, addlabels = TRUE)
variances = data.frame(Dimensions = 1:11,
Percentage = pca$sdev^2/sum(pca$sdev^2)*100)
ggplot(variances)+aes(x = Dimensions, y = Percentage)+
geom_bar(stat = "identity", fill = "#5B84B1FF")+
geom_point()+geom_line()+ theme_minimal() +
scale_x_continuous(breaks = 1:10) +
labs(title = "Scree plot") + theme(text = element_text(size = 6)) +
ylab("Percentage of explained variance")
We can visualize the quality of the representation of the variables with the cos2, for which you have to calculate the correlation between the original dataset and the projected one.
fviz_pca_var(pca, col.var="cos2",repel = TRUE, labelsize = 2,
gradient.cols = c("blue", "orange"))+
theme(text = element_text(size = 6))
The main conclusion we can extract from our PCA is that there are not many strong relations, and that we can explain half of the model by taking the first four components. What would have happened if we scaled the data?
pca2 = prcomp(data, center = T, scale. = T); summary(pca2)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.3516 1.1289 1.04181 1.01213 0.99899 0.98635 0.94746
## Proportion of Variance 0.1661 0.1159 0.09867 0.09313 0.09073 0.08844 0.08161
## Cumulative Proportion 0.1661 0.2819 0.38060 0.47373 0.56445 0.65290 0.73450
## PC8 PC9 PC10 PC11
## Standard deviation 0.93341 0.89518 0.81149 0.76768
## Proportion of Variance 0.07921 0.07285 0.05987 0.05358
## Cumulative Proportion 0.81371 0.88656 0.94642 1.00000
fviz_screeplot(pca2, addlabels = TRUE)
#comparison of both pca and pca2
barplot(pca$rotation[,1], las=2, col="#E16036")
barplot(pca2$rotation[,1], las=2, col="#2E294E")
As we see, the distribution of our data happens to be the same in both cases; the same variables have importance and represent the same quantity of information.
In some databases, if we transformed some factors to numeric values it may suppose a problem to scale the data. On the other hand, whenever we have all variables in different units, it will me helpful to scale them in order to fit the circle better.
Secondly, let’s create another database by discarding not important variables (based on the graphs), and see if we reduce noise:
data3 = select(data, -company_size, -training_hours, -education_level)
pca3 = prcomp(data3, center = T, scale. = F); summary(pca3)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 7.1628 1.57112 0.94865 0.52672 0.42187 0.38559 0.26679
## Proportion of Variance 0.9267 0.04459 0.01626 0.00501 0.00321 0.00269 0.00129
## Cumulative Proportion 0.9267 0.97132 0.98757 0.99259 0.99580 0.99849 0.99977
## PC8
## Standard deviation 0.11254
## Proportion of Variance 0.00023
## Cumulative Proportion 1.00000
corrplot(cor(data3), is.corr = F,
number.cex = 0.4, tl.cex = 0.4, cl.cex = 0.4)
fviz_screeplot(pca3, addlabels = TRUE)
In this case, results are totally different. Only with the first three variables, we have an 98.8% of information.
barplot(pca3$rotation[,1], las=2, col="#0FFF95")
In this case, it is clear that relevant_experience is the most important variable for the model, followed by last_new_job and enrolled_university. We can see that relations between our first two models and this third one (in which we tried to reduce noise) keep more or less the correlations.
pca4 = prcomp(data3, center = T, scale. = T); summary(pca3)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 7.1628 1.57112 0.94865 0.52672 0.42187 0.38559 0.26679
## Proportion of Variance 0.9267 0.04459 0.01626 0.00501 0.00321 0.00269 0.00129
## Cumulative Proportion 0.9267 0.97132 0.98757 0.99259 0.99580 0.99849 0.99977
## PC8
## Standard deviation 0.11254
## Proportion of Variance 0.00023
## Cumulative Proportion 1.00000
corrplot(cor(data3), is.corr = F,
number.cex = 0.4, tl.cex = 0.4, cl.cex = 0.4)
fviz_screeplot(pca4, addlabels = TRUE)
This PCA is more sensible to scaling, and we appreciate that this time the weights of the variables are more distributed across the first components, which is curious. This way, we eliminate that balance issue of the database.
barplot(pca4$rotation[,1], las=2, col="#0FFF95")
Factor analysis is a statistical data reduction technique used to explain the correlations between observed variables in terms of a smaller number of unobserved variables called factors. When setting up the differences between factor analysis and other techniques, it is vital to know that restrictions on linearity and normality are more strict in comparison. Our first worry is how many factors should we use; as I have been taking into account all the time how much information the model represents in the first three components, I will use 3 by default.
fanal1 = factanal(data, 3, rotation = "varimax", scores = "regression")
fanal1
##
## Call:
## factanal(x = data, factors = 3, scores = "regression", rotation = "varimax")
##
## Uniquenesses:
## city_development_index gender relevent_experience
## 0.685 0.998 0.390
## enrolled_university education_level major_discipline
## 0.834 0.973 0.948
## experience company_size last_new_job
## 0.860 0.931 0.832
## training_hours target
## 0.999 0.005
##
## Loadings:
## Factor1 Factor2 Factor3
## city_development_index -0.283 -0.480
## gender
## relevent_experience 0.780
## enrolled_university -0.362 -0.177
## education_level -0.118
## major_discipline 0.228
## experience 0.134 0.349
## company_size 0.247
## last_new_job 0.360 0.195
## training_hours
## target 0.984 0.116 0.115
##
## Factor1 Factor2 Factor3
## SS loadings 1.068 0.973 0.505
## Proportion Var 0.097 0.088 0.046
## Cumulative Var 0.097 0.186 0.231
##
## Test of the hypothesis that 3 factors are sufficient.
## The chi square statistic is 381.76 on 25 degrees of freedom.
## The p-value is 1.67e-65
fanal2 = factanal(data, 3, rotation = "none", scores = "regression")
fanal2
##
## Call:
## factanal(x = data, factors = 3, scores = "regression", rotation = "none")
##
## Uniquenesses:
## city_development_index gender relevent_experience
## 0.685 0.998 0.390
## enrolled_university education_level major_discipline
## 0.834 0.973 0.948
## experience company_size last_new_job
## 0.860 0.931 0.832
## training_hours target
## 0.999 0.005
##
## Loadings:
## Factor1 Factor2 Factor3
## city_development_index -0.343 0.437
## gender
## relevent_experience 0.130 0.762 0.111
## enrolled_university -0.122 -0.368 0.124
## education_level 0.118
## major_discipline -0.226
## experience 0.168 -0.327
## company_size 0.102 0.229
## last_new_job 0.379 -0.151
## training_hours
## target 0.997
##
## Factor1 Factor2 Factor3
## SS loadings 1.170 0.954 0.422
## Proportion Var 0.106 0.087 0.038
## Cumulative Var 0.106 0.193 0.231
##
## Test of the hypothesis that 3 factors are sufficient.
## The chi square statistic is 381.76 on 25 degrees of freedom.
## The p-value is 1.67e-65
cbind(fanal1$loadings, fanal1$uniquenesses)
## Factor1 Factor2 Factor3
## city_development_index -0.28326449 -0.066857440 -0.47974019 0.6851434
## gender -0.01303137 0.006317856 0.04658683 0.9976168
## relevent_experience 0.03988904 0.780056966 -0.01178520 0.3897815
## enrolled_university -0.05931015 -0.362004784 -0.17688471 0.8341446
## education_level -0.08096397 0.080505905 -0.11821733 0.9729891
## major_discipline -0.01365783 0.004182644 0.22768405 0.9479514
## experience 0.01377505 0.133748076 0.34899536 0.8601227
## company_size 0.07890051 0.247431743 -0.04508652 0.9305236
## last_new_job -0.02074055 0.360125061 0.19511291 0.8318011
## training_hours -0.02201464 -0.014295825 0.01519208 0.9990832
## target 0.98402823 0.116063458 0.11496846 0.0050000
cbind(fanal2$loadings, fanal2$uniquenesses)
## Factor1 Factor2 Factor3
## city_development_index -0.343198908 -0.080846443 0.436506244 0.6851434
## gender -0.006668926 0.012987558 -0.046550120 0.9976168
## relevent_experience 0.130161918 0.762190626 0.111095547 0.3897815
## enrolled_university -0.121909998 -0.368158317 0.124301108 0.8341446
## education_level -0.084114978 0.077107642 0.118282651 0.9729891
## major_discipline 0.013587355 0.030616715 -0.225659161 0.9479514
## experience 0.070105669 0.167865929 -0.326775589 0.8601227
## company_size 0.101791249 0.228602392 0.082825361 0.9305236
## last_new_job 0.044875196 0.378766910 -0.150702098 0.8318011
## training_hours -0.021626402 -0.009556137 -0.018994553 0.9990832
## target 0.997494132 -0.002014260 0.001193804 0.0050000
Observation: Factor 1 is clearly related to target (variance explained at 99%), Factor 2 with relevant_experience, Factor 3 is not strongly correlated with any of them, but it is the most correlated with the city_development_index.
barplot(fanal1$loadings[,1], las=2, col="#85BAA1", ylim = c(-1, 1))
barplot(fanal1$loadings[,2], las=2, col="#F4E04D", ylim = c(-1, 1))
barplot(fanal1$loadings[,3], las=2, col="#D4CBE5", ylim = c(-1, 1))
barplot(fanal2$loadings[,1], las=2, col="#AD5D4E", ylim = c(-1, 1))
barplot(fanal2$loadings[,2], las=2, col="#40476D", ylim = c(-1, 1))
barplot(fanal2$loadings[,3], las=2, col="#826754", ylim = c(-1, 1))
Insights: fanal1 has similar results compared to PCA, but in general the weights are less concentrated and more distributed along all variables. What would happen if we try other different factoring models?
#fm = ml -> maximum likelihood
#fm = pa -> principal axis factor analysis
#fm = minres -> minimum residual
fanal3= fa(r = data, 3, rotate = "none", scores = "regression", fm = "ml", residuals = T)
fanal3
## Factor Analysis using method = ml
## Call: fa(r = data, nfactors = 3, rotate = "none", scores = "regression",
## residuals = T, fm = "ml")
## Standardized loadings (pattern matrix) based upon correlation matrix
## ML1 ML2 ML3 h2 u2 com
## city_development_index -0.34 -0.08 0.44 0.31489 0.685 2.0
## gender -0.01 0.01 -0.05 0.00238 0.998 1.2
## relevent_experience 0.13 0.76 0.11 0.61024 0.390 1.1
## enrolled_university -0.12 -0.37 0.12 0.16585 0.834 1.5
## education_level -0.08 0.08 0.12 0.02701 0.973 2.6
## major_discipline 0.01 0.03 -0.23 0.05204 0.948 1.0
## experience 0.07 0.17 -0.33 0.13986 0.860 1.6
## company_size 0.10 0.23 0.08 0.06948 0.931 1.7
## last_new_job 0.04 0.38 -0.15 0.16818 0.832 1.3
## training_hours -0.02 -0.01 -0.02 0.00092 0.999 2.4
## target 1.00 0.00 0.00 0.99500 0.005 1.0
##
## ML1 ML2 ML3
## SS loadings 1.17 0.95 0.42
## Proportion Var 0.11 0.09 0.04
## Cumulative Var 0.11 0.19 0.23
## Proportion Explained 0.46 0.37 0.17
## Cumulative Proportion 0.46 0.83 1.00
##
## Mean item complexity = 1.6
## Test of the hypothesis that 3 factors are sufficient.
##
## The degrees of freedom for the null model are 55 and the objective function was 0.49 with Chi Square of 9426.08
## The degrees of freedom for the model are 25 and the objective function was 0.02
##
## The root mean square of the residuals (RMSR) is 0.02
## The df corrected root mean square of the residuals is 0.03
##
## The harmonic number of observations is 19158 with the empirical chi square 635.46 with prob < 4.6e-118
## The total number of observations was 19158 with Likelihood Chi Square = 381.76 with prob < 1.7e-65
##
## Tucker Lewis Index of factoring reliability = 0.916
## RMSEA index = 0.027 and the 90 % confidence intervals are 0.025 0.03
## BIC = 135.25
## Fit based upon off diagonal values = 0.97
## Measures of factor score adequacy
## ML1 ML2 ML3
## Correlation of (regression) scores with factors 1.00 0.81 0.60
## Multiple R square of scores with factors 1.00 0.66 0.36
## Minimum correlation of possible factor scores 0.99 0.32 -0.28
fanal4 = fa(r = data, 3, rotate = "none", scores = "regression", fm = "pa", residuals = T)
fanal4
## Factor Analysis using method = pa
## Call: fa(r = data, nfactors = 3, rotate = "none", scores = "regression",
## residuals = T, fm = "pa")
## Standardized loadings (pattern matrix) based upon correlation matrix
## PA1 PA2 PA3 h2 u2 com
## city_development_index -0.44 0.36 0.23 0.3757 0.62 2.5
## gender 0.02 0.00 -0.05 0.0032 1.00 1.2
## relevent_experience 0.58 0.48 0.12 0.5834 0.42 2.0
## enrolled_university -0.37 -0.14 0.06 0.1626 0.84 1.3
## education_level -0.04 0.16 0.04 0.0277 0.97 1.2
## major_discipline 0.10 -0.04 -0.20 0.0519 0.95 1.5
## experience 0.25 -0.03 -0.26 0.1322 0.87 2.0
## company_size 0.21 0.14 0.12 0.0785 0.92 2.3
## last_new_job 0.34 0.21 -0.15 0.1789 0.82 2.1
## training_hours -0.02 0.00 -0.03 0.0013 1.00 1.6
## target 0.55 -0.46 0.30 0.6067 0.39 2.5
##
## PA1 PA2 PA3
## SS loadings 1.20 0.68 0.31
## Proportion Var 0.11 0.06 0.03
## Cumulative Var 0.11 0.17 0.20
## Proportion Explained 0.55 0.31 0.14
## Cumulative Proportion 0.55 0.86 1.00
##
## Mean item complexity = 1.9
## Test of the hypothesis that 3 factors are sufficient.
##
## The degrees of freedom for the null model are 55 and the objective function was 0.49 with Chi Square of 9426.08
## The degrees of freedom for the model are 25 and the objective function was 0.02
##
## The root mean square of the residuals (RMSR) is 0.02
## The df corrected root mean square of the residuals is 0.03
##
## The harmonic number of observations is 19158 with the empirical chi square 644.28 with prob < 6.6e-120
## The total number of observations was 19158 with Likelihood Chi Square = 407.51 with prob < 9e-71
##
## Tucker Lewis Index of factoring reliability = 0.91
## RMSEA index = 0.028 and the 90 % confidence intervals are 0.026 0.031
## BIC = 161
## Fit based upon off diagonal values = 0.97
## Measures of factor score adequacy
## PA1 PA2 PA3
## Correlation of (regression) scores with factors 0.83 0.77 0.58
## Multiple R square of scores with factors 0.69 0.59 0.33
## Minimum correlation of possible factor scores 0.39 0.18 -0.33
fanal5 = fa(r = data, 3, rotate = "none", scores = "regression", fm = "minres", residuals = T)
fanal5
## Factor Analysis using method = minres
## Call: fa(r = data, nfactors = 3, rotate = "none", scores = "regression",
## residuals = T, fm = "minres")
## Standardized loadings (pattern matrix) based upon correlation matrix
## MR1 MR2 MR3 h2 u2 com
## city_development_index -0.42 0.12 0.35 0.31648 0.6835 2.1
## gender 0.01 0.01 -0.05 0.00250 0.9975 1.3
## relevent_experience 0.46 0.59 0.19 0.59740 0.4026 2.1
## enrolled_university -0.31 -0.25 0.07 0.16369 0.8363 2.0
## education_level -0.06 0.12 0.10 0.02737 0.9726 2.5
## major_discipline 0.08 0.02 -0.20 0.04793 0.9521 1.3
## experience 0.21 0.10 -0.29 0.14131 0.8587 2.1
## company_size 0.18 0.16 0.12 0.07398 0.9260 2.7
## last_new_job 0.26 0.30 -0.11 0.17081 0.8292 2.2
## training_hours -0.02 0.00 -0.02 0.00095 0.9990 1.9
## target 0.83 -0.51 0.21 0.99550 0.0045 1.8
##
## MR1 MR2 MR3
## SS loadings 1.35 0.82 0.38
## Proportion Var 0.12 0.07 0.03
## Cumulative Var 0.12 0.20 0.23
## Proportion Explained 0.53 0.32 0.15
## Cumulative Proportion 0.53 0.85 1.00
##
## Mean item complexity = 2
## Test of the hypothesis that 3 factors are sufficient.
##
## The degrees of freedom for the null model are 55 and the objective function was 0.49 with Chi Square of 9426.08
## The degrees of freedom for the model are 25 and the objective function was 0.02
##
## The root mean square of the residuals (RMSR) is 0.02
## The df corrected root mean square of the residuals is 0.03
##
## The harmonic number of observations is 19158 with the empirical chi square 623.7 with prob < 1.3e-115
## The total number of observations was 19158 with Likelihood Chi Square = 387.54 with prob < 1.1e-66
##
## Tucker Lewis Index of factoring reliability = 0.915
## RMSEA index = 0.028 and the 90 % confidence intervals are 0.025 0.03
## BIC = 141.03
## Fit based upon off diagonal values = 0.97
## Measures of factor score adequacy
## MR1 MR2 MR3
## Correlation of (regression) scores with factors 0.94 0.86 0.62
## Multiple R square of scores with factors 0.88 0.74 0.39
## Minimum correlation of possible factor scores 0.76 0.48 -0.23
cbind(fanal3$loadings, fanal3$uniquenesses)
## ML1 ML2 ML3
## city_development_index -0.343199011 -0.080843032 0.436537369 0.685113968
## gender -0.006668923 0.012987000 -0.046550366 0.997619927
## relevent_experience 0.130161988 0.762202658 0.111086313 0.389764797
## enrolled_university -0.121910020 -0.368153001 0.124303483 0.834149959
## education_level -0.084114975 0.077108247 0.118276328 0.972989699
## major_discipline 0.013587365 0.030613175 -0.225654510 0.947958259
## experience 0.070105684 0.167858998 -0.326759640 0.860136688
## company_size 0.101791260 0.228601315 0.082821842 0.930520521
## last_new_job 0.044875222 0.378758785 -0.150702942 0.831816621
## training_hours -0.021626403 -0.009556248 -0.018993591 0.999080220
## target 0.997494132 -0.002014322 0.001193931 0.004999973
cbind(fanal4$loadings, fanal4$uniquenesses)
## PA1 PA2 PA3
## city_development_index -0.44066277 0.3578821540 0.23120813 0.6242795
## gender 0.01662688 0.0005612527 -0.05407884 0.9967987
## relevent_experience 0.57983945 0.4824826929 0.12001299 0.4165935
## enrolled_university -0.37450520 -0.1363938701 0.06118041 0.8373995
## education_level -0.04043568 0.1571573281 0.03696811 0.9722999
## major_discipline 0.09543141 -0.0448847006 -0.20194647 0.9480958
## experience 0.25179142 -0.0303237737 -0.26050323 0.8678196
## company_size 0.21406909 0.1350776827 0.11991156 0.9215497
## last_new_job 0.33797945 0.2074958194 -0.14709793 0.8210776
## training_hours -0.01723893 0.0018227521 -0.03118792 0.9987268
## target 0.54732911 -0.4638655534 0.30324810 0.3933002
cbind(fanal5$loadings, fanal5$uniquenesses)
## MR1 MR2 MR3
## city_development_index -0.42051456 0.121989344 0.35321607 0.683524508
## gender 0.01100262 0.013265017 -0.04689016 0.997504295
## relevent_experience 0.46406157 0.587818416 0.19108736 0.402601990
## enrolled_university -0.31378576 -0.245361912 0.07086068 0.836314791
## education_level -0.05877842 0.115014904 0.10339234 0.972626693
## major_discipline 0.07813839 0.021215458 -0.20341380 0.952067123
## experience 0.21436123 0.095003478 -0.29383233 0.858686162
## company_size 0.18348813 0.160843943 0.12018643 0.926016552
## last_new_job 0.26117817 0.302597940 -0.10504208 0.829186613
## training_hours -0.01859798 0.001168926 -0.02464712 0.999045268
## target 0.83467318 -0.506076241 0.20665411 0.004501593
Interpretation: Analysing these tables in detail, the peak (in all cases) of h2 belongs to relevant_experience, followed by city_development_index and last_new_job. A possible interpretation would be that in highly developed cities it is more common for data scientists to get better conditions, increasing temporaryness (last_new_job) because the market that requires them is considerably more competitive than in a country with lower development index (with less technological breakthroughs). Moreover, highly developed countries tend to offer more stable positions to gain experience in the field.
barplot(fanal3$loadings[,1], las=2, col="#85BAA1", ylim = c(-1, 1))
barplot(fanal3$loadings[,2], las=2, col="#F4E04D", ylim = c(-1, 1))
barplot(fanal3$loadings[,3], las=2, col="#D4CBE5", ylim = c(-1, 1))
barplot(fanal4$loadings[,1], las=2, col="#AD5D4E", ylim = c(-1, 1))
barplot(fanal4$loadings[,2], las=2, col="#40476D", ylim = c(-1, 1))
barplot(fanal4$loadings[,3], las=2, col="#826754", ylim = c(-1, 1))
barplot(fanal5$loadings[,1], las=2, col="#2589BD", ylim = c(-1, 1))
barplot(fanal5$loadings[,2], las=2, col="#38686A", ylim = c(-1, 1))
barplot(fanal5$loadings[,3], las=2, col="#B9F5D8", ylim = c(-1, 1))
As we expected, there are many levels to explain factors, for example: while in maximum likelihood weights are less but more concentrated and it is possible to explain most of the variance with two factors, in minimum residual it is not possible because representation is low in each variable. In my opinion, fanal3 and fanal4 lead to a better understanding of the data.
fa.diagram(fanal3); fa.diagram(fanal4); fa.diagram(fanal5)
NOTE: This process is very time and power consuming, so my computer cannot handle these procedures with such big database. I will comment the whole process from now on to be able to export this file.
Clustering is the process of making a group of abstract objects into classes of similar objects. We will start by K-Means (with scaled data), whose process is: 1. Select k centroids randomly 2. Assign each point to the nearest centroid 3. Recalculate centroids based on assigned classes 4. Repeat 2 and 3 until the centroids do not change
scaled_data = scale(data)
k1 = kmeans(scaled_data, centers = 5, nstart = 100)
k1_cluster = k1$cluster
k1_centers = k1$centers
#graphical representation
barplot(table(k1_cluster), col="#858AE3", xlab = "Cluster", ylab = "Observations")
Now we can plot the position of the center of the cluster vs the global center
# plotting centers in cluster 1
bar1 = barplot(k1_centers[1,], las=2, col="#AAF683",ylim = c(-2,2),
main= paste("Cluster 1: center (green) and global center (pink)"))
points(bar1, y = apply(scaled_data, 2, quantile, 0.50),col="#E85D75", pch=19)
#plotting the clusters
fviz_cluster(k1, data = scaled_data, labelsize = 6) +
theme(text = element_text(size = 6))
Question: which is the optimal number of clusters? Let’s discover it.
#fviz_nbclust(scaled_data, kmeans, method = 'wss')
Now we apply the elbow rule and we conclude the ideal number of clusters is 3.
#fviz_nbclust(scaled_data, kmeans, method = 'silhouette')
#fviz_nbclust(scaled_data, kmeans, method = 'gap_stat')
When applying the silhouette method, we state that three are the ideal no. of clusters.
k2 = kmeans(scaled_data, centers = 3, nstart = 100)
k2_clusters = k2$cluster
k2_centers = k2$centers
bar2 = barplot(k2_centers[1,], las= 2, col="#214E34",ylim = c(-5,5), xlab = "Appropriate cluster", ylab = "appropriate observations")
points(bar2 ,y = apply(scaled_data, 2, quantile, 0.5), col ="#C6CCB2", pch = 20 )
We have a better cluster and a worse one due to the mean being off.
b2 = barplot(k2_centers[3,], las = 2, col="#FFC15E",ylim = c(-2,2), xlab = "Inappropriate cluster", ylab = "inappropriate observations")
points(bar2 ,y = apply(scaled_data, 2, quantile, 0.5), col ="#BF4E30", pch = 20 )
CLUSPOT
fviz_cluster(k2, data = scaled_data, repel = F, labelsize = 1, ellipse.type = "norm") + theme_light()
There are three groups, one of them in between of the first and the third one, which are more differentiated. That may be why gap_stat decided different clusters for our model.
k2_min = eclust(scaled_data, "kmeans", stand=T, k=3, graph = T, hc_metric = "minkowski")
Three clusters with an overlap, what if we try with two clusters?
k2_min2 = eclust(scaled_data, "kmeans", stand=T, k=2, graph = T, hc_metric = "minkowski")
The intersect is smaller now. Now let’s try with Mahalanobis distance:
#AUTHOR NOTE: very time consuming
#S_x = cov(data)
#iS = solve(S_x)
#e = eigen(iS)
#V = e$vectors
#B = V %*% diag(sqrt(e$values)) %*% t(V)
#Xtil = scale(data,scale = FALSE)
#dataS = Xtil %*% B
Now we create our k-means models (we try with both 2 and 3 clusters to compare):
#k2_mah = kmeans(dataS, centers=3, nstart=100)
#k2_mah2 = kmeans(dataS, centers=2, nstart=100)
#graphically:
#fviz_cluster(k2_mah, data = scaled_data, stand = T, ellipse = T, labelsize = 1)
#fviz_cluster(k2_mah2, data = scaled_data, stand = T, ellipse = T, labelsize = 1)
How much do they intersect? (answer: in very few points)
#adjustedRandIndex(k2_mah$cluster, k2_mah2$cluster)
Let’s try other approaches such as “PAM” or hierarchical clustering:
Optimal number of clusters?
#AUTHOR NOTE: very time consuming
#fviz_nbclust(scaled_data, pam, method = 'wss', k.max = 20)
#AUTHOR NOTE: very time consuming
#fviz_nbclust(scaled_data, pam, method = 'silhouette', k.max = 20)
By elbow method, 9/10. By the silhouette method, around 16.
#k3 = eclust(scaled_data, "pam", stand=T, k=16, graph = F)
#fviz_cluster(k3, data = scaled_data, stand = T, ellipse = T, labelsize = 1)
#silhouette plot for more info.
#fviz_silhouette(k3)
Let’s see if Minkoski helps avoiding overlapping:
#k3_min1 = eclust(scaled_data, "pam", stand=T, k=16, graph = T,geom , hc_metric = "minkowski")
#differences in clusters changing the distance method
#adjustedRandIndex(k3_min1$cluster, k3$cluster)
We confirm that the method we choose to compute distances matter.
Matrices of distances:
#matrix_d1 = dist(data, method = "euclidean")
#matrix_d2 = dist(data, method = "minkowski", p = 2)
Optimal no. of clusters?
#n1=NbClust(data, distance = "euclidean", min.nc=2, max.nc=8,
# method = "complete", index = "all")
The best number is 2. We try hierarchical clustering with complete and ward.D2.
#k4_1 = hclust(matrix_d1, method = "complete")
#colors = color_branches(as.dendrogram(k4_1), k =2)
#plot(colors)
k = 3 is a very reliable option.
#cluster1 = cutree(k4_1, k = 2)
#table(cluster1)
#difference in clusters' size
#k4_2 = hclust(matrix_d1, method = "ward.D2")
#colors = color_branches(as.dendrogram(k4_2), k =3)
#plot(colors)
#cluster2 = cutree(k4.2, k = 3)
#table(cluster2)
#k4_2_v1 = hclust(matrix_d1, method = "ward.D2")
#colors = color_branches(as.dendrogram(k4_2_v1), k = 6)
#plot(colors)
#cluster2_v1 = cutree(k4_2_v1, k = 6)
#table(cluster2_v1)
There are two different (big) clusters and then many small ones. k= 3 would be a good option, asumming all small clusters shape one big cluster.
Trying “complete” method:
#k4_3 = hclust(matrix_d2, method = "complete")
#colors = color_branches(as.dendrogram(k4_3), k = 2)
#plot(colors)
#cluster3 = cutree(k4_3, k = 2)
#table(cluster3)
Trying “ward.D2” method?
#k4_4 = hclust(matrix_d2, method = "ward.D2")
#colors = color_branches(as.dendrogram(k4_4), k=3)
#plot(colors)
#cluster4 = cutree(k4_4, k = 3)
#table(cluster4)
Intersection of both clusters:
#adjustedRandIndex(cluster2, cluster4)
#they are the same tree
#k4_5 = hclust(matrix_d2, method = "ward.D2")
#colors = color_branches(as.dendrogram(k4_5), k=6)
#plot(colors)
#cluster5 = cutree(k4_4, k = 6)
#table(cluster5)
Again 3 or 4 would fit perfectly not to have many information in the same clusters nor too few.
With this task we can confirm that the study of a database and the techniques previously practised are as variable and subjective as data science itself; but even with this advantage and drawback, the information we can handle thanks to Unsupervised Learning is huge (and this is an example).